AITopics | Media

Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software ( e.g., Adobe Pho-toshop or Stable Diffusion WebUI) and complex activities ( e.g., video editing).

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.61)

Industry:

Education > Educational Technology > Audio & Video (0.71)
Education > Educational Technology > Media (0.61)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

COBE: Contextualized Object Embeddings from Narrated Instructional Video Supplementary Materials

Neural Information Processing SystemsNov-15-2025, 01:08:39 GMT

Dartmouth College Our supplementary materials consist of: 1. Implementation Details. As before the performance of each model variant is evaluated according to the standard mAP detection metric. The ablation studies are conducted on the test set of HowTo100M_BB dataset. As expected, a larger number of negative per single positive sample leads to better results.

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.05)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.05)

Genre: Instructional Material > Course Syllabus & Notes (0.41)

Industry:

Education > Educational Technology > Media (0.41)
Education > Educational Technology > Audio & Video (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.85)

Add feedback

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Neural Information Processing SystemsOct-9-2025, 08:40:11 GMT

Instructional "how-to" videos online allow users to master new skills and everyday DIY tasks, from

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Israel (0.04)

Genre:

Research Report > Promising Solution (0.68)
Instructional Material > Course Syllabus & Notes (0.43)

Industry:

Education > Educational Technology > Audio & Video (0.53)
Education > Educational Technology > Media (0.43)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding

Zhao, Running, Jiang, Zhihan, Zhang, Xinchen, Chang, Chirui, Chen, Handi, Deng, Weipeng, Jin, Luyao, Qi, Xiaojuan, Qian, Xun, Ngai, Edith C. H.

arXiv.org Artificial IntelligenceAug-21-2025

Users often take notes for instructional videos to access key knowledge later without revisiting long videos. Automated note generation tools enable users to obtain informative notes efficiently. However, notes generated by existing research or off-the-shelf tools fail to preserve the information conveyed in the original videos comprehensively, nor can they satisfy users' expectations for diverse presentation formats and interactive features when using notes digitally. In this work, we present NoteIt, a system, which automatically converts instructional videos to interactable notes using a novel pipeline that faithfully extracts hierarchical structure and multimodal key information from videos. With NoteIt's interface, users can interact with the system to further customize the content and presentation formats of the notes according to their preferences. We conducted both a technical evaluation and a comparison user study (N=36). The solid performance in objective metrics and the positive user feedback demonstrated the effectiveness of the pipeline and the overall usability of NoteIt. Project website: https://zhaorunning.github.io/NoteIt/

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746059.3747626

2508.14395

Country:

North America > United States > New York > New York County > New York City (0.05)
Asia > South Korea > Busan > Busan (0.05)
Asia > China > Hong Kong (0.05)
(17 more...)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Educational Technology > Audio & Video (0.96)
Education > Educational Technology > Media (0.86)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
(3 more...)

Add feedback

MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos

Zang, Yuan, Tan, Hao, Yoon, Seunghyun, Dernoncourt, Franck, Gu, Jiuxiang, Kafle, Kushal, Sun, Chen, Bui, Trung

arXiv.org Artificial IntelligenceJun-17-2025

We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.12623

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(2 more...)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Educational Technology > Media (1.00)
Education > Educational Technology > Audio & Video (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Human Computer Interaction (0.85)

Add feedback

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Neural Information Processing SystemsMay-27-2025, 06:43:25 GMT

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Pho- toshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descrip- tions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements.

artificial intelligence, instructional video, machine learning, (9 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.64)

Industry:

Education > Educational Technology > Media (0.64)
Education > Educational Technology > Audio & Video (0.64)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Human Computer Interaction > Interfaces (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.41)

Add feedback

Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge

Li, Bin, Liu, Shenxi, Weng, Yixuan, Du, Yue, Tian, Yuhang, Zhou, Shoujun

arXiv.org Artificial IntelligenceMay-13-2025

Following the successful hosts of the 1-st (NLPCC 2023 Foshan) CMIVQA and the 2-rd (NLPCC 2024 Hangzhou) MMIVQA challenges, this year, a new task has been introduced to further advance research in multi-modal, multilingual, and multi-hop medical instructional question answering (M4IVQA) systems, with a specific focus on medical instructional videos. The M4IVQA challenge focuses on evaluating models that integrate information from medical instructional videos, understand multiple languages, and answer multi-hop questions requiring reasoning over various modalities. This task consists of three tracks: multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Single Video (M4TAGSV), multi-modal, multilingual, and multi-hop Video Corpus Retrieval (M4VCR) and multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Video Corpus (M4TAGVC). Participants in M4IVQA are expected to develop algorithms capable of processing both video and text data, understanding multilingual queries, and providing relevant answers to multi-hop medical questions. We believe the newly introduced M4IVQA challenge will drive innovations in multimodal reasoning systems for healthcare scenarios, ultimately contributing to smarter emergency response systems and more effective medical education platforms in multilingual communities. Our official website is https://cmivqa.github.io/

machine learning, natural language, question answering, (15 more...)

arXiv.org Artificial Intelligence

2505.06814

Country:

Asia > China > Zhejiang Province > Hangzhou (0.24)
Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > China > Beijing > Beijing (0.04)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Health & Medicine (1.00)
Education > Educational Technology > Media (0.83)
Education > Educational Technology > Audio & Video (0.83)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Review for NeurIPS paper: COBE: Contextualized Object Embeddings from Narrated Instructional Video

Neural Information Processing SystemsJan-27-2025, 13:40:42 GMT

While this algorithm is specifically designed for detectors, Miech et al 2019 used unsupervised NCE losses (much like the ones in this paper) in order to understand the natural language descriptions associated with videos; the algorithm presented here seems like the most straightforward extension of this idea to bounding boxes. Little attention is given to demonstrating that the use of bounding boxes fundamentally changes the problem. Update The rebuttal addresses the following point regarding the accuracy of the evaluation. I had misunderstood the annotations that are available with epic kitchens, and therefore I am changing my review. I would encourage the authors to clarify the writing regarding what's available with epic kitchens.

artificial intelligence, benchmark, contextualized object embedding, (10 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.40)

Industry:

Education > Educational Technology > Media (0.40)
Education > Educational Technology > Audio & Video (0.40)

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Neural Information Processing SystemsJan-19-2025, 23:38:19 GMT

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

large language model, natural language, video-mined task graph, (3 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.66)

Industry:

Education > Educational Technology > Media (0.66)
Education > Educational Technology > Audio & Video (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.30)

Add feedback